Augmented Sparse Principal Component Analysis for High 

Dimensional Data0 

Debashis Paul^ and Iain M. Johnstone"'" 

f University of California, Davis 
'^Stanford University 

Abstract 

. We study the problem of estimating the leading eigenvectors of a high-dimensional pop- 

ulation covariance matrix based on independent Gaussian observations. We establish lower 
■ bounds on the rates of convergence of the estimators of the leading eigenvectors under 

' Z'^-sparsity constraints -when an P loss function is used. We also propose an estimator of 

<^ . the leading eigenvectors based on a coordinate selection scheme combined "with PGA and 

' sho"w that the proposed estimator achieves the optimal rate of convergence under a sparsity 

1^ . regime. Moreover, we establish that under certain scenarios, the usual PGA achieves the 

' minimax convergence rate. 
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1 Introduction 



Principal components analysis (PCA) has been a "widely used technique in reducing dimen- 
sionality of multivariate data. A traditional setting "where PCA is applicable is "when one has 
repeated observations from a multivariate population that can be described reasonably well 
by its first t"wo moments. When the dimension of sample observations, is fixed, distributional 
properties of the eigenvalues and eigenvectors of the sample covariance have been dealt with 
at length by various authors. Anderson (1963), Muirhead (1982) and Tyler (1983) are among 
. standard references. Much of the "large sample" study of the eigen-structure of the sample 

covariance matrix is based on the fact that, sample covariance approximates population co- 
variance matrix well when sample size is large. However, due to advances in data acquisition 
■ technologies, statistical problems, where the dimensionality of individuals are of nearly the same 

psj i order of magnitude as (or even bigger than) the sample size, are increasingly common. The 

following is a representative list of areas and articles where PCA has been in use. In all these 
cases denotes the dimension of an observation and n denotes the sample size. 



• Image recognition : The face recognition problem is to identify a face from a collection of 
I faces. Here each observation is a digitized image of the face of a person. So typically, with 

128 X 128 pixel grids, one has to deal with a situation where ~ 1.6 x 10^. Whereas, a 
standard image database, e.g. that of students of Brown University Wickerhauser (1994), 
may contain only a few hundred pictures. 

• Shape analysis : Stegmann and Gomez (2002), Cootes, Edwards and Taylor (2001) outline 
a class of methods for analyzing the shape of an object based on repeated measurements 



^This manuscript was written in 2007, and a version dated December 11, 2007 has been available on the first 
author's website at http : //anson.ucdavis . edu/'^debashis/techrep/augented-spca.pdf . But it is posted to 
arXiv now in its 2007 form as that version has since been cited in work by many authors. Revisions incorporating 
later work will be posted separately. 
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that involves annotating the objects for landmarks. These landmarks act as features of 
the objects, and hence, can be thought of as the dimension of the observations. For a 
specific example relating to motion of hand Stegmann and Gomez (2002) , the number of 
landmarks is 56 and sample size is 40. 

• Chemometrics : In many chemometric studies, sometimes the data consists of several 
thousands of spectra measured at several hundred wavelength positions, e.g. data col- 
lected for calibration of spectrometers. Vogt, Dable, Cramer and Booksh (2004) give an 
overview of some of these applications. 

• Econometrics : Large factor analysis models are often used in econometric studies, e.g. in 
dealing with hundreds of stock prices as a multivariate time series. Markowitz's theory of 
optimal portfolios ask this question. Given a set of financial assets characterized by their 
average return and risk, what is the optimal weight of each asset, such that the overall 
portfolio provides the best return? Laloux, Cizeau, Bouchaud and Potters (2000) discuss 
several applications. Bai (2003) considers some inferential aspects. 

• Climate studies : Measurements on atmospheric indicators, like ozone concentration etc. 
are taken at a number of monitoring stations over a number of time points. In this litera- 
ture, principal components are commonly referred to as "empirical orthogonal functions" . 
Preisendorfer (1988) gives a detailed treatment. EOFs are also used for model diagnostics 
and data summary Cassou, Deser, Terraty, Hurrell and Drevillon (2004). 

• Communication theory : Tulino and Verdu (2004) give an extensive treatment to the 
connection between random matrix theory and vector channels used in wireless commu- 
nications. 

• Functional data analysis : Since observations are curves, which are typically measured 
at a large number of points, the data is high dimensional. Buja, Hastie and Tibshirani 
(1995) give an example of speech dataset consisting of 162 observations - each one is a 
periodogram of a "phoneme" spoken by a person. Ramsay and Silverman (2002) discuss 
other applications. 

• Microarray analysis : Gene microarrays present data in the form expression profiles of 
several thousand genes for each subject under study. Bair, Hastie, Paul and Tibshirani 
(2006) analyze an example involving the study of survival times of 240 (= n) patients 
with diff'use large B-cell lymphoma, with gene expression measurements for 7389 (= N) 
genes. 

Of late, researchers in various fields have been using different versions of non-identity covariance 
matrices of growing dimension. Among these, a particularly interesting model assumes that, 

(*) the eigenvalues of the population covariance matrix S are (in descending order) 

^1, . . . ,^M,cr^, • • • jCr^, 

where £m > > 0. 
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This has been deemed the "spiked population model" by Johnstone (2001). It has also been 
observed that for certain types of data, e.g. in speech recognition Buja, Hastie and Tibshirani 
(1995), wireless communication Telatar (1999), statistical learning (Hoyle and Rattray (2003, 
2004)), a few of the sample eigenvalues have limiting behavior that is different from the be- 
havior when the covariance is the identity. This paper deals with the issue of estimating the 
eigenvectors of S, when it has the structure described by (*), and the dimension N grows to 
infinity together with sample size n. 

In many practical problems, at least the leading eigenvectors are thought to represent some 
underlying phenomena. This has been one of the reasons for their popularity in analysis of what 
can be characterized as functional data. For example, Zhao, Marron and Wells (2004) consider 
the "yeast cell cycle" data of Spellman et al. (1998), and argue that the first two components 
obtained by a functional PCA of the data represent systematic structure. In climate studies, 
empirical orthogonal functions are often used for identifying patterns in the data, as well as for 
data STimmary. Sec for example Corti, Moltcni and Palmer (1999). In many of these instances 
there is some idea about the structure of the eigenvectors of the covariance matrix, such as to 
the extent they are smooth, or oscillatory. At the same time, these data are often corrupted with 
a substantial amount of noise, which can lead to very noisy estimates of the eigen-elements. 
There is also a growing literature on functional response models in which the regressors are 
random functions and the responses are either vectors or functions (Chiou, Miiller and Wang 
(2004), Hall and Horowitz (2004), Cardot, Ferraty and Sarda (2003)). Quite often a functional 
principal component regression is used to solve these problems. Thus, there are both practical 
and scientific interests in devising methods for estimating the eigenvectors and eigenvalues that 
can take advantage of the information about the structure of the population eigenvectors. At 
the same time, there is also a need to address this estimation problem from a broader statistical 
perspective. 

In multivariate analysis, there is a huge body of work on estimation of population covariance, 
and in particular on developing optimal strategies for estimation from a decision theoretic point 
of view. Dey and Srinivasan (1985), Efron and Morris (1976), Half (1980), Loh (1988) are some 
of the standard references in this field. However, a decision theoretic treatment of functional 
data analysis is still somewhat limited in its breadth. Hall and Horowitz (2004) and Tony Cai 
and Hall (2005) derive optimal rates of convergence of estimators of the regression function and 
fitted response in functional linear model context. Cardot (2000) gave upper bounds on the rate 
of convergence of a spline-based estimator of eigenvectors under some smoothness assumptions. 
Kneip (1994) also derived similar results in a slightly different context. 

In this paper, the aim is to address the problem of estimating eigenvectors from a minimax 
risk analysis viewpoint. Henceforth, the observations will be assumed to have a Gaussian 
distribution. This assumption, though somewhat idealized, helps in bringing out some essential 
features of the estimation problem. Since algebraic manipulation of spectral elements of a 
matrix is rather difficult, it is not easy to make any precise finite sample statement about 
the risk properties of estimators. Therefore the analysis is mostly asymptotic in nature, even 
though efforts have been made to make the approximations to risk etc. as explicit as possible. 
The asymptotic regime considered here assumes a triangular array structure in which N, the 
dimensionality of individual observations, tends to oo with sample size n. This framework is 
partly motivated by similar analytical approaches to the problem of estimation of mean function 
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in nonparametric regression context. In particular, a squared error type loss is proposed, and 
some /''-type sparsity constraint is imposed on the parameters, which in our case are individual 
eigenvectors. Relevance of this sort of constraints in the context of functional data analysis is 
discussed in Section [3l The main results of this chapter are the following. Theorem 1 describes 
risk behavior of sample eigenvectors as estimators of their population counterparts. Theorem 
2 gives a lower bound on the minimax risk. An estimation scheme, named Augmented Sparse 
Principal Component Analysis (ASPCA) is proposed and is shown to have the optimal rate 
of convergence over a class of /'^ norm-constrained parameter spaces under suitable regularity 
conditions. Throughout it is assumed that the leading eigenvalues of the population covariance 
matrix are distinct, so the eigenvectors are identifiable. A more general framework, which 
looks at estimating the eigen-subspaces and allows for eigenvalues with arbitrary multiplicity, 
is beyond the scope of this paper. 

2 Model 

Suppose that, {Xi : i = l,...,n}„>i is a triangular array, where the x 1 vectors Aj := 
A",i = l,...,n are i.i.d. on a common probability space for each n. The dimension N is 
assumed to be a function of n and increases without bound as n — )• oo. The observation vectors 
are assumed to be i.i.d. as A^(^, S), where is the mean vector; and T, is the covariance matrix. 
The assumption on S is that, it is a finite rank perturbation of (a multiple of) the identity. In 
other words, 

M 

S = X,d,el + a^I, (1) 

where Ai > A2 > • • • > Xm > 0, and the vectors 9i,...,0m are orthonormal. Notice that 
strict inequality in the order relationship among the Aj^'s implies that the Oi, are identifiable 
up to a sign convention. Notice that with this identifiability condition, 61, is the eigenvector 
corresponding to the z/-th largest eigenvalue, namely, + cr'^, of S. The term "finite rank" 
means that, M will remain fixed for all the asymptotic analysis that follows. This analysis 
involves letting both n and N increase to infinity simultaneously. Therefore, S, the Aj^'s and 
the 6'jy's should be thought of as being dependent on N. 

The observations can be equivalently described in terms of the factor analysis model : 

M 

Xjk = g + VKvuiOuk + crZjk, i = l,...,n, k=l,...,N. (2) 

k=l 

Here, for each n, v^i, Zik are all independently and identically distributed as A(0, 1). M > 1 
is assumed fixed. 

Since the eigenvectors of S are invariant to a scale change in the original observations, for 
simplifying notation, it is assumed that a = 1. Notice that this also means that, Ai, . . . , Aj\/ 
appearing in the results relating to the rates of convergence of various estimators of Oi, should 
be changed to Xi/a, . . . , Xm /cr when ([1]) holds with an arbitrary a > 0. 

Another simplifying assumption is that, = 0. This is because, the main focus of the current 
exposition is on estimating the eigen-structure of S, and the unnormalized sample covariance 
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matrix 

n 

where X is the sample mean, has the same distribution as that of the matrix 

n-l 
i=l 

where Yi are i.i.d. N(0,'E). This means that, for estimation purposes, if the attention is 
restricted to the sample covariance matrix, then from an asymptotic analysis point of view, it 
is enough to assume ^ = 0, and to define the sample covariance matrix as S = -XX-^, where 
X= [Xi : ... 

The following condition, or Basic Assumption will be used frequently, and will be referred 
to as BA. 

BA (12]) and ([I]) hold, with ^ = and cr = 1; iV = N{n) ^ oo as n ^ oo; Ai > . . . > Aa/ > 0. 

For the estimation problem, it may be assumed that, as n,A^ — )• oo, 9^ := 6^ — )■ 9,^ in /^(M), 
though it is not strictly necessary. But this assumption is appropriate if the observation vectors 
are the vectors of first coefficients of some noisy function in L?'{D) (where D is an interval 
in M), when represented in a suitable orthogonal basis for the L?'{D) space. See Section ?? for 
more details. In such cases one can talk about estimating the eigenfunctions of the underlying 
covariance operator, and the term consistency has its usual interpretation. However, even if 0" 
does not converge in l"^, one can still use the term "consistency" of an estimator 0" to mean 
that L{9^,9^) — )• in probability as n — >• oo, where L is an appropriate loss function. 

2.1 Squared error type loss 

The goal is, given data Xi,X2, . . . , Xn, to estimate 9^, for = 1, . . . , M. To assess the perfor- 
mance of any such estimator, a minimax risk analysis approach is proposed. The first task is to 
specify a loss function for this estimation problem. Observe that since the model is invariant 
under separate changes of sign of the 9^, it is necessary to specify a loss function that is also 
invariant under a sign change. We specify the following loss function : 

L(a,b) = L([a],[b]) := 2(1 - |(a, b)|) =|| a - si5n((a, b))b f, (3) 

where a and b are N x 1 vectors with P norm 1; and [a] denotes the equivalence class of a 
under sign change. Note that, L(a,b) can also be written as min{|| a — b |p, || a + b There 
is another useful relationship with a different loss function, denoted by Ls(a, b) := sin'^ ^(a, b), 
for any two x 1 unit vectors a and b. sinZ(-,-) is a metric on the space i.e. the 

unit sphere in M^. Also, Ls{a,h) = sin2Z(a,b) = 1 - |(a,b)|2 = L(a, b)(2 - L(a, b)). Hence, 
if L{a, b) w 0, then these two quantities have approximately the same value. This implies 
that, the asymptotic risk bounds derived in terms of the loss function L remain valid, up to a 
constant factor, for the loss function Lg as well. 
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2.2 Rate of convergence for ordinary PC A 

It is assumed that either Ai is fixed, or that it varies with n and so that, 
LI as n, — oo, ^ — for = 1, . . . , M, where 1 = pi > p2 > ■ ■ ■ > Pm', 



L2 as n, N ^ oo. 



N 
nh(Xi) 



0, where 



hiX) 



A2 



1 + A 



(4) 



N 



Notice that, all four conditions (i)-(iv) below imply that ^^^-^^^^ 
(i) f ^7G(0,oo)and;^^0 





as n — 7- oo. 



(ii) Ai ^ 0, ^ ^ and , 



N 



(iii) < lim inf„_j>oo Ai < lim sup„_i.oQ Ai < oo and 



TV 







(iv) ^ 



OO, and 



0. 



Remark : Condition LI is really an asymptotic identifiability condition which guarantees that 
at the scale of the largest "signal" eigenvalue, bigger eigenvalues are well-separated. 

Theorem 1: Suppose that the eigenvalues Ai,...,Am satisfy LI and L2. Iflog{n\/N) = 
o{n A A^), then for v = I, . . . ,M , 



sup EL{e^,e^ 



N-M ^ ly. (A^ + 1)(A^ + 1) 



(l + o(l)). 



(5) 



Remark : It is possible to relax some of the conditions stated in the theorem. On the other 
hand, with some reasonable assumptions on the decay of the eigenvalues, it is also possible to 
incorporate cases where M is no longer a constant, but increases with n. Then the issues would 
include, rates of growth of M and the rate of decay of eigenvalues that would result in the OPCA 
estimator retaining consistency and the expression for its asymptotic risk. These issues are not 
going to be addressed here. However, it is important to note that, such questions have been 
investigated - not necessarily for the Gaussian case - in the context of spectral decomposition 
of stochastic processes by. Hall and Horowitz (2004), Tony Cai and Hall (2005), Boente 
and Fraiman (2000), Hall and Hosseini-Nasab (2006) among others. However, these analyses 
do not deal with measurement errors. The condition .^^'^ — t- is a necessary condition for 
uniform convergence, as shown in Theorem 2. It should be noted that, there are results, proved 
under slightly different circumstances, that obtain the rates given by ([5]) as an upper bound 
on the rate of convergence of OPCA estimators (Bai (2003), Cardot (2000), Kneip (1994)). 
These analyses, while treating the problem under less restrictive assumptions than Gaussianity 

(essentially, finite eighth moment for the noise Zi^), make the assumption that > 0, when 

the Ajy's are considered fixed. 



6 



3 Sparse model for eigenvectors 



In this section we discuss the concept of sparsity of the eigenvectors and impose some restrictions 
on the space of eigenvectors that lead to a sparse parametrization. This notion will be used 
later from a decision-theoretic view point in order to analyze the risk behavior of estimators of 
the eigenvectors. Prom now on, will be used to denote the matrix [^i, . . . , 0m\- 

3.1 l'^ constraint on the parameters 

The parameter space is taken to be a class of M-dimensional positive semi-definite matrices 
satisfying the following criteria: 

• Ai > . . . > Am- 

• For each i/ = 1, . . . , M, di, G B,^ for some Qi, C S^~^ that gives a sparse parametrization, 
in that most of the coefficients O^i^ are close to zero. 

• ^1 , • • • 5 are orthonormal. 

One way to formalize the requirement of sparsity is to demand, as in Johnstone and Lu (2004), 
that 9y belongs to a weak-i'^ space wl'^{C) where C,q > 0. This space is defined as follows. 
Suppose that the coordinates of a vector x G are . . . , |a;|(Ar), where |a;|(fe) is the k-th 

largest element, in absolute value. Then 

xGwZ«(C) ^ |a;|(fc) < Cfc-i/«, A; = 1,2,.... (6) 

In the Functional Data Analysis context, one can think of the observations as the vectors of 
wavelet coefficients (when transformed in an orthogonal wavelet basis of sufficient regularity) 
of the observed functions. If the smoothness of a function g is measured by its membership in 
a Besov space B^, ^, and if the vector of its wavelet coefficients, when expanded in a sufficiently 
regular wavelet basis, is denoted by g, then from Donoho (1993), 

g e B^,^, ^ g G wli, q = if a > {1/q' - 1/2)+. 

One may refer to Johnstone (2002) for more details. Treating this as a motivation, instead of 
imposing a weak-Z^ constraint on the parameter 0^, we rather impose an constraint. Note 
that, for C,q> 0, 

N 

xGR^nZ''(C) ^ 5^kfc|«<C«. (7) 

k=l 

Since l'^{C) ^ wl'^{C), it is possible to derive lower bounds on the minimax risk of estimators 
when the parameter lies in a wl'^ space by restricting attention to an Z* ball of appropriate 
radius. 

For C> 0, define @q{C) by 

N 

e,(C) = {aGS^-i: J]|afer<Cn, (8) 

k=l 
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where S ^ is the unit sphere in M centered at 0. One important fact is, if < g < 2, for ©q(C) 
to be nonempty, one needs C > 1, while for q> 2, the reverse inequahty is necessary. Further, 
for < g < 2, if C"? > iVi-'?/^ then the space Qq{C) reduces to §^ ^ because in this case, the 
vector {1/y/N , . . . , is in the parameter space. Also, the only vectors that belong 

in the space when C = 1 are the poles, i.e. vectors of the form (0, 0, ... , 0, ±1,0, . . . , 0), where 
the non-zero term appears in exactly one coordinate. Define, for q G (0, 2), mc to be an integer 
> 1 that satisfies 

< ^ ^ 1)1-9/2. (9) 

Then mc is the largest dimension of a unit sphere, centered at 0, that fits inside the parameter 
space @q{C). 



3.2 Parameter space 

The parameter space for 9 := [6i : . . . : 6m] is denoted by 

M 

{Ci,..., Cm) = {6 eHOqiC) : {e,,e,>)=0, for u^u'}, (10) 
i/=i 

where Qq{C) is defined through ([8]), and C^, > 1 for all = 1, . . . , M. 

Remark : If M > 1, one can describe the sparsity of the eigenvectors in a different way. 

Consider the sequence C = {\l^^=i ^v^tk • ^ ~ 1,2,...,A^). One may demand that 

the vector ^ be sparse in an Z"? or weak-/'' sense. This particular approach to sparsity has some 
natural interpretability, since the quantity C| = S^li ^v^tk-- where Cfc is the A;-th coordinate 
of C, is the variance of the fc-th coordinate of the "signal" part of the vector X. There is a 
connection between this model and the model we intend to study. If (|10p holds, then C, G Z^(Ca), 

where C\ = ^^l^ y?J'^Cl. On the other hand, l'^ (weak-Z'^) sparsity of (" implies l'^ (weak-/'') 
sparsity of di, for all = 1, . . . , M. 



3.3 Lower bound on the minimax risk 

In this section a lower bound on the minimax risk of estimating Qy over the parameter space 
([To]) is derived when < g < 2, under the loss function defined through ([3]). The result is 
stated under some simplifying assumptions that make the asymptotic analysis more transparent. 
Define 

Al There exists a constant Co > such that Cg < Cfi — 1 for all // = 1, . . . , M, for all N . 
A2 As n,N ^ oo, nh{X^) — )• oo. 
A3 As n, ^ oo, n/i(A^) = 0(1). 

A4 As n, — )• oo, ng{X^, X,^) — )• oo for all /i = 1, . . . , — 1, + 1, . . . , M. 
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A5 As n,N ^ oo, nmaxi<^^;,<Af g(A^, Aj,) = 0(1). 

Conditions A4 and A5 are applicable only when M > 1. In the statement of the following 
theorem, the infimum is taken over ah estimators 9i^, estimating 9^, satisfying || 9i, \\= 1. 

Theorem 2: Let < q < 2 and 1 < ly < M. Suppose that Al holds. 

(a) If A3 holds, then there exists Bi > such that 

liminfinf sup EL{9u,9u) > Bi. (12) 
eeef (Ci,...,Cm) 

(b) // A2 holds, then there exists B2 > 0, Ag > 0, and c\ G (0, 1), such that 

Uminf(5^Mnf sup EL{9^,9^) > B2, (13) 



where 6n is defined by 
ci 



Sn = < 



% eeef (Ci,...,Cm) 



if nh{K) <m\n{ci{N - M),AqCl{nh{K))'i''^} 



ci{N - M) <mm{nh{K),AgCl{nh{X,)Y'^] (14) 



and 



^- = ^'-^^)) inhiX.))i-./2 ' ^^^-^^(fo^)'^ ^"^^"^1^'^^ 

(15) 

for some K > 0, a £ (0,1), Cg{a) £ (0, 1) and Ag,a > 0. Here Cl := Cl — 1. Also, one 
can take ci = log(9/8), Ag = {^f~^l'^, Ag^^ = (a/2)i-«/2, Cg{a) = (a/9)i-'?/2, ^2 = 1 
and B3 = (8e)-^ 

(c) Suppose that M > 1. // A4 holds, then there exists B3 > such that 

liminf^^^inf sup EL{9^„9^) > B3, (16) 
eeef (Ci,...,Cm) 

where ^ ^ 

5„ = - max — - — — ■ . (17) 
n ^,e{l,...,M}\{'^} 9{X^,Xy) 

One can take B^ = However, if A5 holds, then is true. 

Remark : In the statement of Theorem 2, there is much flexibihty in terms of what values the 
"hyperparameters" Ci, . . . ,Cm and the eigenvalues Ai, . . . , Am can take. In particular, they can 
vary with N, subject to the modest requirement that Al is satisfied. However, the constants 
appearing in equations (fTSll and (fT6ll are not optimal. 
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Remark : Another notable aspect is that, as the proof later shows, the rate lower bounds in 
Part (b) are all of the form jj^jj-^, where m is the "effective" number of "significant" coordinates. 
This phrase becomes clear if one notices further that, in the construction that leads to the lower 
bound (see Section 16. 7p , the vector O^, in a near- worst case scenario has overwhelming number of 
coordinates of size const. , ^ , or, in the case (1151 ). of size const. . Here m is of the 

same order as the number of these "significant" coordinates. This suggests that, an estimation 
strategy that is able to extract coordinates of of the stated size, would have the right rate 
of convergence, subject to possibly some regularity conditions. The estimator described later 
(ASPCA) is constructed by following this principle. 

Part (a) and the second statement of Part (c) of Theorem 2 depict situations under which 
there is no estimator that is asymptotically uniformly consistent over 0^(Ci, . . . , Cm)- More- 
over, the first part of Part (b), and Theorem 1 readily yield the following corollary. 

Corollary 1: If the conditions of Theorem 1 hold, and if Al holds, together with the condition 
that 

then the usual PCA-based estimator of By, i.e. the eigenvector corresponding to the u-th largest 
eigenvalue ofS, has asymptotically the best rate of convergence. 

Remark : A closer look at the proof of Theorem 1 reveals that the method of proof explicitly 
made use of condition LI to ensure that the contribution of Ai, . . . , Aa/ to the residual term of 
the second order expansion of 9i, is bounded. However, the condition n max^^,y 9{^^l, ^u) — )• 00 is 
certainly much weaker than that. The method of proof pursued here fails to settle the question 
as to whether this is sufficient to get the asymptotic rate ([5]). It is conjectured that this is the 
case. 

4 Estimation scheme 

This section outlines an estimation strategy for the eigenvectors 9^, v = 1, . . . , M. Model ([2]) 
is assumed throughtout for observations Xj, i = 1, . . . , n. We propose estimators is for the case 
when the noise variance cr^ is known. Therefore, without loss of generality, it can be taken to 
be 1. Henceforth, for simplicity of notations, it is also assumed that ^ = 0. In practice, one may 
have to estimate cj^ from data. The median of the diagonal entries of the sample covariance 
matrix S := ^XX"^ serves as a reasonable (although slightly biased) estimator of cr^, if the 
true model is sparse. In the latter case, the data are rescaled by multiplying each observation 
by and the resultant covariance matrix is called, with a slight abuse of notation, S. Note 
that, in this case, the estimates of eigenvalues of S are a"^ times the corresponding eigenvalues 
of S. 

4.1 Sparse Principal Components Analysis (SPCA) 

In order to motivate the approach that is described in what follows, consider first the SPCA 
estimation scheme studied by Johnstone and Lu (2004). To that end, let S = ^yiX^ denote 
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the sample covariance matrix. Suppose that the sample variances of coordinates (i.e., diagonal 
terms of S) are denoted by (jf , . . . , a^. 

• Define In to be the set of indices k G {!,••• ,N} such that (t| > 7„ for some threshold 
7n > 0. 

• Let Sf f be the submatrix of S corresponding to the coordinates /„. Perform an eigen- 
analysis of St f . Denote the eigenvectors by ei, . . . , e . r ,7 i, . 

• For v = 1, . . . , M, estimate O^, by e^/ where e^, an A'' x 1 vector, is obtained from e^/ by 
augmenting zeros to all the coordinates that are in {!,..., N} \ /„. 

Johnstone and Lu (2004) showed that, if one chooses an appropriate threshold 7„, then the 
estimate of 9i, is consistent under the weak-Z^ sparsity constraint on Oi,. However, Paul and 
Johnstone (2004) showed that even with the best choice of 7„, the rate of convergence of the risk 
of this estimate is not optimal. Indeed, Paul and Johnstone (2004) demonstrate an estimator 
which has a better rate of convergence in the single component (M = 1) situation. 



4.2 Augmented Sparse PCA (ASPCA) 

We now propose the ASPCA estimation scheme. This scheme is a refinement of the SPCA 
scheme of Johnstone and Lu (2004), and can be viewed as a generalization of the estimation 
scheme proposed by Paul and Johnstone (2004) in the single component (M = 1) case. 

The key idea behind this estimation scheme is that, in addition to using the coordinates 
having large variance, if one also uses the covariance structure appropriately, then under the 
assumption of a sparse structure of the eigenvectors, one will be able to extract a lot more 
information and thereby get more accurate estimate of the eigenvalues and eigenvectors. Notice 
that SPCA only focuses on the diagonal of the covariance matrix and therefore ignores the 
covariance structure. This renders this scheme suboptimal from an asymptotic minimax risk 
analysis point of view. To make this point clearer, it is instructive to analyze the covariance 
matrix in the M = 1 case. In view of the second Remark after the statement of Theorem 
2 one expects to be able to recover coordinates k for which \Oik\ 2> —7==- However, the 

■s/nh{Xi) 

best choice for 7^ for SPCA is for some constant 7 > 0, which is way too large. On 

the other hand, suppose that one divides the coordinates into two sets A and B, where the 
former contains all those k such that \9k\ is "large", and the latter contains smaller coordinates. 
Partition the matrix S as 

J, _ ^AA "^AB 
l^BA ^BB_ 

Here T,ba = M(^i,b(^ia- Assume that, there is a "preliminary" estimator of 9i, say 9i such 
that, {9i^j{, 9i ji) — > 1 in probability as n — t- 00. Then one can use this estimator as a "filter", in 
a way described below, to recover the "informative ones" among the smaller coordinates. This 
can be seen from the following relationship 

'^ba9i,a = {9i,a,9i,a)^i9i,b ~ Ai6'i,b. 
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In this manner one can extract some information about those coordinates of 9i that are in set 
B. The algorithm described below is a generalization of this idea. It has three stages. First 
two stages will be referred to as "coordinate selection" stages. The final stage consists of an 
eigen- analysis of the submatrix of S corresponding to the selected coordinates, followed by a 
hard thresholding of the estimated eigenvectors. 

Let 7i > for i = 1,2,3 and k > be four constants to be specified later. Define 7i^„ = 

71 y^^^. 

1° Select coordinates k such that a^k ■= ^kk > l+7i,n- Denote the set of selected coordinates 
by h,n- 

2° Perform spectral decomposition of Sr f . Denote the eigenvalues hy £i > . . . > £mi 
where nii = min{n, |/i,n|}i and corresponding eigenvectors by ei, . . . ,6^^. 

3° Estimate M by M defined in Section SSI Estimate Xj hy Xj = ij - I, j = 1, . . . , M. 
4° Define E = [^ei : . . . : ^er^l. Compute Q = S^, f E. 

5° Denote the diagonal of the matrix QQ"^ by T. Define /2,n to be the set of coordinates 
/c G {1, . . . , N} \ Ii^n such that \Tk\ > 72 n where 




/ log(n V iV) 1 

72,n =12\ \ + - „ 

n K \ n 



6° Take the union /„ := Ii ni] I2 n- Perform spectral decomposition of Sr r . Estimate 
6y by augmenting the z^-th eigenvector, with zeros in the coordinates {1, . . . , A^} \ /„, for 
u = I, . . . , M. Call this vector 6^- 

7° Perform a coordinatewise "hard" thresholding of 9i, at threshold 



73,n := 731 



/log(n V A^) 



nh{X„) 

and then normalize the thresholded vectors to get the final estimate 9,^. 

Remark : The scheme is specified except for the "tuning parameters" 71,72,73 and k. The 
choice of 7i's is discussed in the context of deriving upper bounds on the risk of the estimator. 

It will be shown that, it suffices to take 71 = 4, k = 2 + e for a small e > 0, and 72 = yj^^^- An 
analysis of the thresholding scheme is not done here, but in practice 73 = 3 works well enough, 
and some calculations suggest that 73 = 2 suffices asymptotically. 
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4.3 Estimation of M 



Let 7i,7x > be such that 7i > 7i- Define 



- _ _ _ / log(^^ V N) 
\^ri = {/c : Sfcfc > 1 + 7i,n} where 7i^„ = 7iW , (18) 



n 



= {fc : S,fc > 1 + 7^} where 7^ = 7^^ ^"^^""^ • (19) 



Define 




a„ = 2\/^^ + ^ + 6 f ^ VI 
n \ n 



iog(nv|i;^„|) 
^ ny\iln\ 



(20) 



Let ^1 > . . . > ^mj, where mi = min{n, |/i „,|}, be the nonzero eigenvalues of S- - . Define 
M by 

M = max{l < A; < mi : 4 > 1 + a„}. (21) 
The choice of 7^ and 7^ is discussed in Section I8.5i 



Remark : Sparsity of the eigenvectors is an imphcit assumption for ASPCA scheme. However, 
in practice, and specificahy with only moderately large samples, it is not always the case that 
ASPCA is able to select the significant coordinates. More importantly, the scheme produces a 
bona fide estimator only when is non-empty. If this is not the case, then one may use the 
u-th eigenvector of S as the estimator of 61^. However, determination of M in this situation is 
a difficult issue, and without recourse to additional information, one may set M = 0. 



5 Rates of convergence 

In this section we describe the asymptotic risk of ASPCA estimators under some regularity 
conditions. The risk is analyzed under the loss function and it is assumed that condition 
BA of Section [2] holds. Further, the parameter space iov 9 = [61 : . . . : 6m], over which the 
risk is maximized, is taken to be Q^{Ci, . . . , Cm) defined through (fTO]) in Section 13. 2^ where 
< g < 2 and Ci, . . . , Cm > 1- 

5.1 Sufficient conditions for convergence 

The following conditions are imposed on the "hyperparameters" of the parameter space 0q(Ci, . . . 
Suppose that pi, . . . , pM are as in CI given below. Define 

M 

p,(C):=Y^ptCl. (22) 

Observe that, since C^, > 1 for all = 1, . . . , M, pq(C) > Y^^L^ pl'"^ > 1. 
CI Ai, . . . , Am are such that, as n — )• 00, y' ^ where 1 = pi> p2> ■ ■ ■ > Pm- 



13 



C2 log N X log n and ^^^^^ )• as n — )• oo. 



C3 ^ as n ^ oo. 

We discuss briefly the importance of these conditions. CI is a repetition of LI. C2 is a 
convenient and very mild technical assumption that should hold in most practical situations. 
Second part of C2 is non-trivial only when Ai — )• as n — )• oo. C3 requires some explanation. 
It will become increasingly clear that, in order to get a uniformly consistent estimate of the 
eigenvectors from the preliminary SPCA step, one needs C3 to hold. Indeed, the sequence 
described in C3 has the same asymptotic order as a common upper bound for the rate of 
convergence of the supremum risk of the SPCA estimators of all the 9u^s. So, the implication is 
that if C3 holds then the SPCA scheme of Johnstone and Lu (2004) gives consistent estimates. 

Remark : Note that, ^^j^ < ^ if A G (0, c) and ^^j^ < if A > c, for any c > 0. Since 

Pq{C) > 1, C3 guarantees that 

p,(C)(logiV)i-g/2 

inHX,)y-'^/^ asn^oo. (23) 

In fact, if liminfn^oo Ai > c > 0, then the upper bound in can be 

replaced by o((M)i/2-g/4), 

It will be shown that this is a common (and near-optimal) upper bound on the rate of con- 
vergence of the ASPCA estimate of O^^s. If one compares this with the lower bound given by 
Theorem 2, it is conjectured that (j23|) should also be a sufficient condition for establishing that 
the lower bound defined through (llSp is also the upper bound on the minimax risk, at the level 
of rates. However, since our method depends on finding a preliminary consistent estimator of 
the eigenvectors (in our case SPCA), the somewhat stronger condition C3 becomes necessary 
to establish rates of convergence of the ASPCA estimator. 

5.2 Statement of the result 

Now we state the main result of this section. The asymptotic analysis of risk is conducted only 
for the estimator B^, for eigenvector 9i,, and not for the thresholding estimator 9,^. Derivation of 
the results for 9i, requires additional technical work, but can be carried out. It can be shown that 
in certain circumstances the latter has a slightly better asymptotic risk property. In practice, 
the thresholding estimator seems to work better when the eigenvalues are well-separated. The 
following theorem describes the asymptotic behavior of the risk of the ASPCA estimator 9^ 
under the loss function L defined through ([3]). g{-,-) is defined by (jlip . 



Theorem 3: Assume that BA and conditions C1-C3 hold. Then, there are constants K :- 
K{q, 7i, 72, k) and K' := K'{q, M, 71, 72, k) such that, as n ^ 00, for all i' = 1, . . . , M, 



< 



sup EL{9^,e^) 



log(nViV) V nh{Xu) J ^ng{X^,X^) 



(l + o(l)) (24) 
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Remark : The expression in the upper bound is somewhat cumbersome, but the significance 
of each of the terms in (|24p will become clear in the course of the proof. However, notice that, 
if the parameters Ci, . . . , Cm of the space 0q(M)(Ci, . . . ,Cm) are such that, 

30 <C<C <oo, such that C < ™^^i</^<a^^m < ^^1 n, (25) 

mmi<^<A/ 

then, Theorem 3 and Theorem 2 together imply that, under conditions BA, C1-C3, Al and 
the condition on the hyperparameters given by (jl5p . the ASPCA estimator 6i, has the optimal 
rate of convergence. The condition (p5|) is satisfied in particular if Ci, . . . , Ca/ are all bounded 
above. 

It is important to emphasize that (|24|) is an asymptotic result in the following sense. It 
is possible to give finite a sample bound on supg^QM^^Q^^ Q^^^ ^LiOy^Oy). However, this upper 
bound involves many additional terms whose total contribution is smaller than a prescribed 
e > only when n > n^, say, where depends on the hyperparameters, apart from e. 

Remark : It is instructive to compare the asymptotic supremum risk of ASPCA with that of 
OPCA (or usual PCA based) estimator of 6y. A closer inspection of the proof reveals that, if 
for all sufficiently large n. 



dog(nVA)^V nh{Xy) 

then for some constant K" > 0, under BA and C1-C3, one can replace the upper bound in 
m by 

Alog(nVA) ^-^ 1 



K 



(1 + 0(1)), 



for some constant K. This rate is greater than that of OPCA estimator by a factor of at most 
log(n V N\ However, observe that, the bound on the risk of OPCA estimator holds under 
weaker conditions. In particular, Theorem 1 does not assume any particular structure for the 
eigenvectors. 



6 Proof of Theorem 2 

The proof requires a closer look at the geometry of the parameter space, in order to obtain 
good finite dimensional subproblems that can then be used as inputs to the general machinery, 
to come up with the final expressions. 



6.1 Risk bounding strategy 

A key tool for our proof the lower bound on the minimax risk is Fano's lemma. Thus, it 
is necessary to derive a general expression for the Kullback-Leibler discrepancy between the 
probability distributions described by two separate parameter values. 
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Proposition 1: Let 6^^"^ = [9^'' : ... : 0^^^], j = 1,2 he two parameters. Let '^[j) denote the 
matrix given by with 6 = 9^^^ (and a = 1). Let Pj denote the joint probability distribution 
of n i.i.d. observations from A^(0,S(j)). Then the Kullhack-Leihler discrepancy of P2 from Pi, 
to be denoted by Ki^2 '■= K{9^-^\9^'^^), is given by 



K 



1,2 



def 



K{9^^\9^^^)=n 



M MM 



u=l u'=l 



(26) 



where 



77(A) 



1 + A' 



A > 0. 



(27) 



6.2 Use of Fano's lemma 

We outline the general approach pursued in the rest of this section. The idea is to bound the 
supremum of the risk on the entire parameter space by the maximum risk over a finite subset 
of it, and then to use some variant of Fano's lemma to provide a lower bound for the latter 
quantity. 

Thus, the goal is to find an appropriate finite subset J^q of Q^^{Ci, . . . , Cm), such that the 
following properties hold. 

(1) If 6l(i),6'(2) G Jc-Q, then L(6li,^\#) > 46, for some 5 > (to be chosen). This property 
will be referred to as "4(5-distinguishability in 9i," . 

(2) The element 9 £ J^q is a unique representative of the equivalence class [9], where [9] is 
defined to be the class of x M matrices whose z^-th column is either 9i, or —9^. 

(3) Subject to (1), the quantity sup^.^^-. if(6lW, 6l(j)) + K{9^^\9^'^) is as small as 
possible. 

Given any estimator 9 of 9, based on data X„ = (Xi, . . . define a new estimator (/>(X„) 

(an N X M matrix) as <^(X„) = 9* if 9* = argminegjr^j L{9,^,9u), where 9i, is the v-th column 
of 9 (i.e., estimate of 9,^). Then, by Chebyshev's inequality, 

sup EeL{9^X) > S sup mL{9^X) > 
6»eef (Ci,...,Cm) 0eef(Ci,...,CM) 

> 6supFe{L{9^X)>S) 

> 6 supFemXn)]^[9]). (28) 

The last inequality is because, ifL(#,?^) < 6 for any 9^^^ G J"o, then by the "4(5-distinguishability 
in 9^'' (property (1) above), it follows that [(puO^n)] = [0u^], and hence [(/)(X„)] = [9^^^. 

Two versions of Fano's lemma are found to be useful in this context. The following version, 
due to Birge (2001), of a result of Yang and Barron (1999) (p. 1570-71), is most suitable when 
To can be chosen to be large. 
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Lemma 1: Let {Pq : 6 G &} be a family of probability distributions on a common measurable 
space, where Q is an arbitrary parameter space. Suppose that a loss function for the estimation 
problem is given by L'{9,6') = le^e'- Define the minimax risk over by 

Pmax = inf supP0(T ^ 0),= inf supEL'(0,r), 

where T denotes an arbitrary estimator of 9 with values in B. Then for any finite subset T of 
0, with elements 6i, . . . ,6j where J = 

Q log J 

where Pi = ¥g., and Q is an arbitrary probability distribution, and K{Pi,Q) is the Kullback- 
Leibler divergence of Q from Pi . 



To use Lemma 1 choose Pi to be -Ps(i) = PqH) '■= -^"^"(O, S(j-)), where is the ma- 
trix Ylit=i ^u(^u^Gu^ + I, and 6'(*) G Tq i = 1, . . . , \To\, are the distinct values of parameter 
6 that constitute the set Tq. Then set Qq = Pg(o), for some appropriately chosen 9^'^'^ G 
0^(Ci, . . . ,Cm) such that the following condition is satisfied. 

at;ei<i<|^„|i^(0«,^W)x sup (30) 

i<i<|Jbl 

where the notation "x" means that the both sides are are within constant multiples of each 
other. Then it follows from ([28]) and Lemma 1 that, 

6 sup EeL{d,y,e„) > 1 ■ — - — — . (31) 

ee0M(Ci,...,CM) log|>o| 

To complete the picture it is desirable that 

a?;ei<,<|^„|K(0«,e(o)) + log2 



log ITti 



(32) 



where c is a number between and 1. 

A different version of Fano's lemma, due to Birge (2001), is needed when J^q consists of only 
two elements 6^^^ and 9^'^\ so that the classification problem reduces to a test of hypothesis of 
Pi against P2- 

Lemma 2: Let ax and /3t denote respectively the Type I and Type II errors associated with an 
arbitrary test T between the two simple hypotheses Pi and P2. Define, iimis = inf 7^(07^ + (3t), 
where the infimum is taken over all test procedures. 

K{Pl,P2) > - log[7r^i,(2 - TTrms)]- (33) 
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6.3 Geometry of the parameter space 

We view the space @q{C), for < q < 2, as the A^-dimensional unit sphere centered at the 
origin, from which some parts have been chopped off, symmetrically in each coordinate, such 
that there is some portion left at each pole (i.e., a point of the form (0, . . . ,0, ±1,0, ... ,0), 
where the non-zero term appears only once). In this connection, we define an object that is 
central to the proof of Theorem I3.3[ 

Definition : Let < r < 1 and N > m > 1. An {N,m,r) polar sphere at pole ko, on set 
J = {ji, . . . ,jm}, where 1 < ko < N and ji G {1, . . . , A^} \ {ko} for I = 1, . . . ,m, is a subset of 
S^^^ given by 

m 

S{N, m, r, ko, J) := {x G S^'^ : Xk, = y/l-r^ ^ 4 = r^}. (34) 

1=1 

So, an {N,m,r) polar sphere is centered at the point (0, . . . , 0, Vl — 0, . . . , 0), (which is 
not in §^~^), has radius r, and has dimension m. Note that, the largest sphere of any given 
dimension m, such that < (equivalently, m > mc, where mc is defined through 

([9])), that can be inscribed inside 6g(C) is an {N,m,r) polar sphere. The radius r of such a 
polar sphere, given C < m^""?/^ (or m > mc), to be denoted by rm(C), satisfies 

{1 - (r„(C))2}'?/2 + m'~'^/^r^{C)r = (35) 

Of course, if C > m^~'^/^ (or mc > m ) then as a convention, rm{C) = 1. Condition (I35p 
ensures that all the points lying on an {N,m,r) polar sphere such that r G (0,rm(C)), are 
inside 0g(C). 

6.4 A common recipe for Part (a) and Part (b) 

In the proof of Part (a) and Part (b) of the theorem, there is a common theme in the construction 
of J-Q. Let e^ denote the A'^- vector whose //-th coordinate is 1 and rest are all zero. In either 
case, if {0'^^\j = 1, • • • , |-^o|} is an enumeration of the elements of J-q, then the following are 
true. 

(Fl) There is an TV x M matrix e^^\ such that of^ = e^. 

(F2) 0^^'^ = e^, for ^ = 1, . . . , - 1, + 1, . . . , M, for all j = 0, 1, . . . , \Fo\. 

(F3) Ov^ G S{N, m, r, u, J) for some m, r and J. m and r are fixed for all 1 < j < but J 
may be different for different j, depending on the situation. 

The 6'(o) in (Fl) is the same 

0(0) 

appearing in (|3ip . Also, (|26p simplifies to 
^(^O), 0(0)) = in/i(A.)(l - {{ei^\e^^^)f) = \nh{Ky, j = l,..., \Fo\. (36) 

Moreover, in either case, the points 9^^^ are so chosen that 

L{ei^\e'^^^) > r^, for ah l<j^k< \Fo\. (37) 
In other words, the set J-q is r"^ distinguishable in 9^. 
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6.5 Proof of Part (a) 

Construct Fq satisfying (F1)-(F3), with 



^i) = Vl-r2e^ + rej, j = M + 1, . . . , TV, 

where r E (0, 1) is such that (1 -r^)?/^ + < Cl. Thus, |J"o| = N-M. Verify that ([37]) holds, 
in fact the lower bound is 2r'^, with an equality. Therefore, (j3ip applies, with (5 = Since 
nh{\u) is bounded above, and log(A^ — M) — )• oo as n — )• oo, (fT2]) follows from (p6]l . 

6.6 Connection to "Sphere packing" 

Our proof of Part (b) of Theorem 2 depends crucially on the following construction due to Zong 
(1999). 

Let m be a large positive integer, and uiq = [^] (the largest integer < ^). Define as 
the maximal set of points of the form z = (zi, . . . , z^) in S"^"^ such that the following is true. 

m 

^niQZi £ { — 1,0, 1} V i, = -y/^tQ and, for z,z'Gy^, ||z — z'||>l. (38) 

i=l 

For any m > 1, the maximal number of points lying on such that any two points are at 

distance at least 1, is exactly same as the kissing number of an m-sphere. It is known that this 
number is < 3™ and > (9/8)'"(^+°(^)) . Zong (1999) uses the construction described above to 
derive the lower bound, by showing that \Yj^\ > (9/8)™(^+°(-'^)) for m large. 

6.7 Proof of Part (b) 

Structures of J-q for the three cases in ()14p are similar. Set m < (N — M), large. Set ci = 
log(9/8), Aq = (9ci/2)^-'?/2. Choose r « ^/6^, and define the set Jb satisfying (F1)-(F3) and 
the following construction. 

Set \To\ = \Y^\, where Y^ is the set defined in Section WM Set, 

m 

B^J) = Vl_r2e, + r^zPei+M, j = 1, . . . , | Jo|, (39) 
1=1 

where z^-'-' = {z[^\ . . . , Zm'^), j > 1, is an enumeration of the elements of Y^. Observe that, for 
all j > 1, 

ei^'^ G S{N, m,r,u,{M + 1,...,M + m})f] S{N, mo, r, u, supp{z^^'^)), (40) 

where supp{z^^^) is the set of nonzero coordinates of z^^\ Therefore, ()37p and (j36p hold for all 
J > 1- 
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6.7.1 Case : nh{X^) < min{ci(A^ - M), y4gC^(n/i(A^))'?/2} 
Take m = [nh{Xp)] and = ci. Observe that, for all j > 1, 

II 00) ||^= (1 _ ^2)g/2 ^ ^i-5/2^g < ^ (2/9)i-''/2(n/i(A,))i-^/2cf ' < 1 + ciC;^ < C^. 

Thus, -Fo C 0^(Ci, . . . , Cm). Further, since n/i(A,y) oo, log |-Fo| > cin/i(Aiy)(l + o(l)). Since 
([37|) and p6]) hold, with = from (f3T]) the result follows, because 

az;ei<,<|^„|i^(0O),0(o))+iog2 icin/i(A,) + log 2 1 
lim sup j j < lim sup 77T~{ ~ ~ • 

n^oo log|-/-o| n^oo Cinh[\y) 2 

6.7.2 Case : ci{N - M) < min{n/i(A^), AgC^(n/i(A^))'?/2} 
Take m = N — M and = -^^^j-y^- Then, for all j > 1, 

II 0i'^ ||^< 1 + {2/9)^-'^/\N - M)cf\nh{X,))-'i/^ <l + cl = C^. 

The result follows by arguments similar to those used for the case nh{\y) < min{ci(Af — 
M),AgCl{nh{X,))''/^}. 

6.7.3 Case : AgCl{nh{\^))i/'^ < mm{nh{\^),ci{N - M)} 

Take m = ''/^(9/2)i-9/2C^(n/i(A^))'?/2] and = ci^^. Again, verify that m ^ oo as 
n — >• oo (by Al), and for j > 1, 

II 00) ||.< 1 + i2/9)'-'^/'m'-'^/'cf{-^y/^ <l + Cl = CI, 
and the result follows by familiar arguments. 

6.7.4 Proof of (fT5]> 

The construction in all three previous cases assumes that the set of non-zero coordinates is held 
fixed (in our case {M + 1, . . . , M + m}) for every fixed m. However, it is possible to get a bigger 
set To satisfying the requirements, if this condition is relaxed. 

Suppose that Ag^a = («/2)^~'^''^, and the condition in (fT5]) holds for some a G (0,1). 
Set m = [(a/9)-9/2 (9/2)1-9/2^9 ^n/i(A^))«/2 (log Ar)-9/2] ^^^^ ^2 ^ (^/9)_m__ rp^j^g ^ 

(a/9)^"9/2. Observe that m^ooasn^oo,m = 0(A^^"°) and r G (0, 1). Set 9^^'' = [ei : 
. . . : bm]- For every set vr C {M + 1, . . . , N} of size m, construct J> satisfying (F1)-(F3) such 
that, 

0O) = yr^e, + rj;zPe;, j = 1, . . . , |y^|. (41) 

lew 
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As before, C @g\Ci, . . . , Cm), for all vr, so that ([36]) and ([37]) are satisfied. Let "P to be a 
collection of such sets vr such that, for any two sets tt and tt' in "P, the set vr n vr' has cardinality 



at most This ensures that 



for y,y'G U-^-' My,y')> 



This also ensures that the sets are disjoint for vr 7^ vr', since each 6^^ for ^^-'^ G J^q is nonzero 
in exactly niQ + 1 coordinates. Define = IJ^g^ J^. Then 

l-^ol = I U -^^1 = 1^1 \Y^\ > |P|(9/8r(i+°W). (42) 

By Lemma 7, stated in Section[931 there is a collection V such that |P| is at least exp([A^<f (m/9A^)- 
2m£{l/9)]{l + 0(1))), where <f^(x) is the Shannon entropy function : 

£{x) = —X log(x) — (1 — x) log(l — x), < X < 1. 

Since £{x) ~ — xlogx when x — >• 0+, it follows from that, 

^^^jp^ > (log N - log m) - 2^(1/9) + log 9 + log(9/8)] (1 + o(l)) > | log 7V(1 + o(l)), 

since m = 0{N^~'^). Finally, observe that 

«^e,o)g|^,i^(0(j),0(o))^log2 l(a/9)mlogiV 1 

lim sup j j < lim sup — — -- — = - 

n->oo log|-Ao| n->oo (a/9)mlogA' 2 

and use pip to finish argument. 



6.8 Proof of Part (c) 

Consider first the proof of (jl6p . Fix a /x € {1, . . . ,M} \ {u}. Define 9^^^ and ^'■^^ as follows. 
Set = ng(\i,\2) (assume w.l.o.g. that r < 1 A Co). Take O^J} = e^/, j = 1,2 for all n' / iJ., v. 
Define 

eW = e,, 0(2) = Vl - r2e, + re^, 0« = e^, 6^;}^ = -re, + y/l^e^. (43) 

Observe that ^Jf^ ± 0^^'^ j = 1,2, = ^/T^ = (^i^^^i^^) and = r = 

-{e^u\e^;}^). Also, by Al, G Gf (Ci, . . . , Cm), for j = 1, 2. 
Let Pj = 7V®"(0,S(j)). Then 

i^(Pi,P2) + A) = n[h{\,){l - Ke«,e(^))P) + /.(A.)(l - K0«,0(2))n 

- ^(A;.^(A.) + A.r?(A^)){K0«,0(2)^|2 ^ 1(0(1), 

= n[(/i(A^) + h{X,)y - ^(A^r/(A,) + A,7?(A^))r2] 

= n5(A^,A,)r2. (44) 
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Apply Lemma 2 for testing Pi against P2- Define Pmis = iTifxiaT V fSj-) and observe that 
Pmis < T^mis < "^Pmis- Since the lower bound in p3]) is symmetric w.r.t. -Kmis, and vTmis is 
symmetric w.r.t. Pi and P2, it follows that 

ng{X^, A,)r2 = K(Pi, P2) + i^(P2, Pi) > -2 log(^^i,(2 - vr^,)). 

This implies that 

p-^gi^ij.,^i^)r^ < . (2 — TT ■ ) < 2-7r ■ < An ■ 

Since, L{9W^0(^)) = 2(1 - VT"^) > r^, and = use ([28]) with Jq = {e^^\e'^^^} 

and 5 = to get, 

sup EeL{9^X) > ^ ^ 



eeef (ei,...,6iM) »eng(A^,Ai.) 

Now, let n vary over all the indices 1, . . . , z/ — 1, z/ + 1, . . . , M and the result follows. 

In the situation where <5,i 7^ 0, as n — )• 00, simply take n (/ u) to be the index for which 
gi^fi, ^u) is minimum. Then apply the same procedure as in above with r G (0, Co) fixed. 

7 Proof of Theorem 1 

We require two main tools in the proof of Theorem 1 - one (Lemma 5) is concerned with the 
deviations of the extreme eigenvalues of a Wishart(A^, n) matrix and the other (Lemma 6) 
relates to the change in the eigen-structure of a symmetric matrix caused by a small, additive 
perturbation. Sections 19.11 and 19.21 are devoted to them. The importance of Lemma 6 is that, 
in order to bound the risk of an estimator of Oj^ one only needs to compute the expectation 
of squared norm of a quantity that is linear in S (or a submatrix of this, in case of ASPCA 
estimator). The second bound in ()132p then ensures that the remainder is necessarily of smaller 
order of magnitude. This fact is used explicitly in deriving ([66|) . 

Remark : In view of Lemma 6, HyiTj) becomes a key quantity in the analysis of the risk of 
any estimator of Op. Observe that, 

1 1 

H,:=H,{T.)= V e,,el,-—{i-y^e,,el), u = i,...,m. (45) 

l<.'^u<M " ^'^ ^'^ u'=l 

Expand matrix S as follows. 

II ||2 / I 1 \ 

/x=l /i=l ^ ^ 



+ E ^^^V^^.^M' + -ZZ^- (46) 
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In order to use Lemma 6, an expression for H,yS9,y is needed. Use the fact that H^Oi, = and 
9j^6^ = (Kronecker's symbol), to conclude that 



n 



n 



(47) 



Further, from (|l5|) it follows that, H,y9f^ = ^ 9^, if ^ / z^. Also, 

M 

H^Zv, = - — {I -^9^9j;)Zv, + >^^;3^(Zi;„e^)0^, 



and 



1 ^ 1 

H,ZZ^9, = -^{I-Yl ^M^^)ZZ^e. + ^ ———{Z'^9^, Z^9, 



(48) 



(49) 



From gT]), (08]) and (09]), it follows that 



HlyS9l, 



Er4 



\a-{Zv^,9y) + ^/\'J-f^ZVy,9^i , 



— (/ - ^ ^^.OzZ^e, - ^=(7 - 0^0j)Z^,. 



(50) 



Let r be an iV X (iV - M) matrix such that F^F = /, and FF^ = (/ - Ejii ^/^^/T)- Then, 
T9^ = for all /u = 1,. . . ,M. 

A crucial fact here is that, since has i.i.d. A^(0, 1) entries, and is independent of Z, for any 
D G M™^", DZj^^ has a Nm{0, DD^) distribution, and is independent of Vf^. Furthermore, 

since 9^ are orthonormal, and F0^ = for all /i, it follows that Z'^9^ has a A^„,(0, /) distribution; 
{Z^6'^}^'£]^ are mutually independent and are independent of FZ. 

Next, we compute some expectations that will lead to the final expression for E || HyS9y |p. 

K i^/\-{Zv^,9,) + ^/K-{Zv,,9^] 
\ n n 

^ ^^E((Z^;^, 9,)f + A,E((Z^;,, 9^)f + 2^/\Xm'^v^, 9^){Zv,,9^))\ ' 



n 



(51) 
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since the cross product term vanishes, which can be verified by a simple conditioning argument. 
By similar calculations, 

2 



\ n n J 



KK + 1 



n 



(52) 



and 



^{^/\.-{Zv^,e,) + ^/K-{Zv,,d^)] (^/\X-{v^,v,) + -{z^e^,z^e,)] =o. (53) 

\ n n J \ n n J 

Since trace{TT'^) = N — M, from the remark made above, it follows that, 

M 

E II (/ - ^ e^fil)ZZ^du f = ¥.[{elZ)Z^TT'^Z{Z^du)] = n{N - M), (54) 

M 

(I -^&i^Gl)^v^f = ^\\vuf'^\\T^Zj^^^f=n{N-M), (55) 
/i=i 

and 

M M 

E{{I - 0^Oj;)ZZ^e,, {I-Y, = E[t;JZ^rr^Z(Z^0,)] = O. (56) 

Use (f50|) . and equations (jST]) - (j56|) . together with the orthonormality of 0^'s and the fact that 
T9^ = for all to conclude that, 

E||i..S..f^j^ + lE "!/-"'V< (57) 

nh[K) (A^ - KY 

The next step in the argument is to show that, maxo<^<A/(A^ — A^+i)^"*^ || S — S || is small 
with a very high probability. Here, by convention, Aq = oo and Aa^+i = 0. From (j46]l . 

M „ ,,9 M 



n — ' n 



+ E V^l^^^l+ II -ZZ^ - ^ II ■ (58) 



Define, for any c > 0, Di n{c) to be the set 

*^ II „, Il2 



' ' n V n 



{| ^^/-'V) |<e./M^}. (59) 
I I I I ^ V n 

l<^t<^'<A/ 
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Use Lemmas 14 and 15 to prove that, 

1 - P(L'i,„(c)) < 3M(n V N)-"^ + M{M - l)(n V Ar)-|c'+0{iog(nviV)/n)_ ^gg) 
Define Do n\C) as 

^2,n(c) = {II -ZZ^ _ / ||< 2 J- + - + Ct„}, (61) 
n \ n n 

with t„ as in Lemma 5. From (j58p . ()60p and (jl23p . it follows that for n > Uc, 

P(|| S-S ||> e„,7v(c,A)) < l-P(Z)i,„(c)nD2,n(c)) 
< (3Af + 2)(n ViV)-^' +M(M- l)(n V7V)-i^'+^('°s("^^)/"), (52) 

where 



/ , X /V^ , X / log(n V N) 



n — • \ V n /\ i\ I S n 

/j=i /i=i 



Define 



n 

l<At^^t'<Af 



(1 + 


'log(n V TV) 


n^N 




N 




+ — + Ctn. 


V 11 


n 


~^}en,iv(v 


/2,A), 



(64) 



and observe that 5n,N,u — >■ as n — 00 under LI and L2. 
To complete the proof of ^ , write 



- sign{ele^)e^ = -h^so^ + r^. (65) 

Since 6n,N,u 0, by (fT32|) . (fT30|) . (fml) and §2\), and the fact that < A^, for sufficiently 
large n, on I)i,„,(\/2) n D2,niV2), 

II /7,S0, f (1-5;,^,J2 <|| ^^g^^ ||2 (i + ^/^^^ j2^ (gg) 

where 

— -2 [1 + 2(1 + 5n,N,.)il - 25n,7V,.(l + 25„,7V,.))], (67) 
(1 - 2c)n,Af,;/(i + 20„,Ar,,y)j^ 

and as n oo. Since 1(9^, 6^) < 2, ([62]), ([M]) and dS?]) together imply (P. 



8 Proof of Theorem 3 

In some respect the proof of Theorem 3 bears resemblance to the proof of Theorem 4 in John- 
stone and Lu (2004) . The basic idea in both these cases is to first provide a "bracketing relation" . 
This means that, if I„ denotes the set of selected coordinates, and /„ and In are two non-random 
sets with suitable properties, then an inequality of the form P(/„ C I„ C /„) > 1 — 6n holds. 
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where 6„ converges to zero at least polynomially in n. Once this relationship is established, one 
can utilize it to study the eigen-structure of the submatrix f of S. The advantage of this 
is that the bracketing relation ensures that the quantities involved in the perturbation terms 
for the eigenvectors and eigenvalues can be controlled, except possibly on a set of probability 
at most bn- 

The proof of Theorem 3 follows this principle. However, there are several technical aspects in 
both the steps that require much computation. The first step, namely, establishing a bracketing 
relation for In, is done in Sections 18.21 - [8T6l The second step follows more or less the approach 
taken in the proof of Theorem 1, in that, on a set of high probability, an upper bound on 
L{0y,6y) is established that is of the form || HyS20y |p (1 + where Hy is as in (jl5]) . S2 is 
the matrix defined through equation (j93|) . and (5„ — )• 0. Then, by a careful examination of the 
different terms in an expansion of H^S20u, it is shown that an upper bound on E || H^S20i, |p 
is asymptotically same as the RHS of (j24p . This is done in Section 18.91 Some results related 
to the determination of correct asymptotic order of the terms in the aforementioned expansion 
are given in Section 19.51 Before going into the detailed analysis, it is necessary to fix some 
notation. 



(68) 



8.1 Notation 

For any symmetric matrix D, \k{D) will denote the k-ih largest eigenvalue of D. Frequently, 
the set {1, . . . , A^} will be divided into complementary sets A and B. Here A may refer to the 
set of coordinates selected either in the first stage, or in the second stage, or in a combination 
of both. S will be partitioned as 

^AA ^AB 
Sba ^BB 

where Sab is the submatrix of S whose row indices are from set A, and column indices are 
from set B. Any x 1 vector x may similarly be partitioned as x = (x^ : y^)"^. And for an 
N X k matrix Y, and Y^ will denote the parts corresponding to rows with indices from set 
A and B, respectively. It should be clear, however, that no specific order relation among these 
indices is assumed, and in fact the order of the rows is unchanged in all of these situations. 
Expressions like (j68p are just for convenience of writing. 

8.2 Bracketing relations 

In this section the bracketing relationship is established. The proof involves several parts. It 
essentially boils down to probabilistic analysis of 1° - 5° of the ASPCA algorithm. This is done 
in several stages. The coordinate selection step in 1° and 2° are jointly referred to as the first 
stage, and steps 3°, 4° and 5° are jointly referred to as the second stage. 

8.3 First stage coordinate selection 

In this section 1°, i.e., the first stage of the coordinate selection scheme, is analyzed. Define 

M 

Ck = Y,XA, k = l,...,M. (69) 

u=l 
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For < a_ < 1 < a+, define 



lL = {k:(,>o.,-n^M^}. (70) 



n 



It is shown that satisfies the bracketing relation ()74p . 
Let cr^ := Cfc + 1- The selected coordinates are 



h,n = {k:S,,>l + ,,{-^^i^}. (71) 
Note that, Skk ~ ^fcX^„)/"- Then, 

P(/i-„ {Z: /i,„) = P(U,g,- JSfcfc < 1 + 7i,„}) < Yl ^^^kk < 1 + 7i,n) 

- ^ ^ ^2 - 1 



fc l + a+7i,n 

2 

< lA,nF( 1< — 7— )' ( Since, Sfcfc ~ CTfcX(„)/n ) 

< i-frjiiWiij-W"*-")''")"*"")). (72) 

Similarly, if n > 16 then, 

P(/l,„ Iln) = IP(Ufc^/+ JSfcfc > 1 + 7i,„}) < ^(Sfc'^ > 1 + ^1.") 

< y > AlJlf^) < Arp(^ _ 1 > 

^ ^2 1 + a_7i,„ ' n 1 + a_7i,„ 

< Ar(Arvn)-(^?(i-'^-)'/^)(i+°W). (73) 

Combine (f72ll and (f73]l to get, as n — )• oo, 

l-P(/l-„C/i,„C/+J 
< |/-J(Ar Vn)-(^?('^+-i)'/4)(i+<'«) +iV(iV Vn)-(^?(i-'^-)V4){i+o(i)). (74) 

For futm'e use, it is important to have an upper bound on the size of the sets I^^. To this end, 
let c = (ci, . . . , cm) be such that Cjy > for all v and ^1^=1 ^ — ^■ 

M M . 

Itn = {ke{l,...,N}:Y, KOlk > aT7i,n} C lj{k e {1, . . . ,N} : \e,k\ > ^J^^}- 
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Since G Gf (Ci, . . . , Cg), and /«(C7) ^ wl'^{C), it follows from above that, 



M 



n 



q/4 



(log(A^ Vn))9/4' 



(75) 



In fact, the upper bound is of the form Ji,n(c, 71, flzp) A N, since there are altogether 
coordinates. Set c = (Af ^-"^Z^, . . . , M^^/^), and denote the corresponding Ji^„(c, 71, a=p) by 
Ji_„(7i, a=p). Whenever there is no ambiguity about the choice of 71 and azp, Ji,n(7i;a=F) ^^^^ 
be denoted by J^^. Notice that CI and C2 imply that J^^ — >■ 00 as n — >• 00. And C3 implies 



that 



nh{Xi) 



0. 



Remark : From now onwards, the set {/^ „ C /i,n C Iin} will be denoted by Gi^n- Observe 
that Gi^n depends on 9. However, from ([7^ . it follows that, if 71 = 4, a+ > 1 + and 
< a_ < 1 — then there is an eo > and an no > 1, that depend on a+ and a_, such that 
for n > no, 

P(Gf J<(iVVn)-i-^o, (76) 



uniformly in ^ G Gf (Ci, . . . , Cm) 



8.4 Eieen-analysis of Sr r 

Throughout we follow the convention that {ey,9^j^ ) > 0. Define 



Si := 



h.nll,'. 

o 



o 
o 



Si := 



Il,n,I\, 

o 



o 
I 



(77) 



Let efc be the eigenvector associated with eigenvalue 1^ of Si, for /c = 1, . . . , mi, where mi = (nA 
|/i^„|). Eigenvalues of Si belong to the set {^1, . . . , £mj}U{l}; and the eigenvector corresponding 
to the eigenvalue £fc is e^, 1 < k < mi. Note that, 1^ is not necessarily the k-ih. largest eigenvalue 
of Si. However, the analysis here will show that this happens with very high probability for 
sufficiently large n. 
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Let = 6( J+Jn V 1) Jlog(n V J+J/(n V J+J. Define, 



2v/2^^ /log(nVJ+J 



Ai 



i/=i 

M 



n 



2 / 
^^^1 + 2^2, 



i/=i 



Ai V n 



Ai 



n n 

M 



log(nVJ+ )\ /J+ 



n 



1-9/2 

Cga+ 7i 



1 12^ A, AinV2-g/4 



i/=i 



where Cq = 2^^- Observe that, under conditions C1-C3, maxi<j<5ej^„ — >■ as n — >■ o 
Set A = A+ = /+„, B = If ={1,...,N}\ and define 



and 



M II 



M 



n n {^^±^< 11+2V2, 



n 



n n {i 

l<i^<i^'<M 



n 



n 



log(nV J+ )\ /J+ 



n 



} 



Then the following results hold. 

Lemma 3: Under conditions C1-C3, 

3 5 
[]G,,n C {|A,(Si)-(l + A.)|<Ai^ 

P((G2,n n Gs^nT) < ^M{n V J+„)-2 + M(M - l)n-=^+°(^) + 2(n V J+„)- 
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Lemma 4: Let h^n = 6(|If„|/n V 1) Jlog(n V |/i V |Ii Under conditions C1-C3, 



M+l 



> (1 + Y %^)' + ^i,n: h,n C /+J < 2(n V |/+J)-2. (83) 



Remark : Let G4,„ = {^Af+i < (1 + V %^)^ + \/2ti,„}, where ti, 

n is as in Lemma 4. Observe 

that G4,„ depends on 9; however, P(Gi,n n Gl J < 2n~'^ for all 6 G Qg^{Ci, . . . , Cm). It is easy 
to check that, under C1-C3, 



/j+ j+ 

2V — + — = o(Ai) as n ^ oo. (84) 
V n n 

Therefore, from Lemma 3 and Lemma 4 it follows that, for sufficiently large n, uniformly in 
9 G Qq^{Ci, . . . , Cm), 

5 

-fl + A,.)l > a/ 

l<u<M 



P( max I?,- (1 + A,) I >AiVe,,„, Gi,„) < Ki(M)n-2, (85) 



for some constant K\{M) that does not depend on 9. 

8.5 Consistency of M 

Proposition 2: Under conditions C1-C3, and with a„ defined through i20\}. M is a consistent 
estimator of M . In particular, if = 9, 7^ = 3, then there are constants a+ > 1 > a_ > 0, 
1 > a' > 0, and an n^,o such that for n > n=KO, uniformly in 9 £ 0g^(Ci, . . . ,Cm), 

P(M / M) < K2{M)n~^~^\ (86) 
/or some constants K2{M) > and ei := £1(71,7^,0-1-, a') > independent of 9. 

8.6 Second stage coordinate selection 

Steps 4^^ and 5*^ of the ASPCA scheme are analyzed in this subsection. For future reference, 
it is convenient to denote the event 0^=1 Gj,n H {M = M} by Ci^n- The ultimate goal of this 
section is to establish ([92]) . Throughout, it is assumed that BA and C1-C3 are valid. Observe 



that, by definition (see 4° and 5° of ASPCA scheme), Tk = Ylt=i ^fc^ ^ ^ ^i-"' ^^"^ define it 
to be zero otherwise. 

8.6.1 A preliminary bracketing relation 

First, define 

M 

Ck = Y.h{X,)9lk, k = l,...,N. (87) 

u=l 
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Define, for < 72,- < 72 < 72, 



r± r, 7 2 log(iVVn), 
lt = {k:Ck> 72 T 



Observe that C,k > ^?(AM)Cfc- This implies that, for some n=Ki > n*o V n'^gi ^ 
I+i C I~, uniformly in 6* G 0*^(Ci, . . . , Cm)- Note that 

P({/- C U /2,„ C Z+}^ < P(I- U /2,„, + P(/i,„ U /2,„ t In^Gl,n)- 

In the following, D is a generic measurable set w.r.t. the ci-algebra generated by Z and 
wi, . . . , vm- Then, for n > n.,,!, 

P(/l,„ U l2,n (t- In, Gl,n D D) = P(U^^^+{A; G U /2,n}, 

= P(Ufc^^+{A; G /2,„ n TlJ, Gi,n DD) < ^ P(A; G /2,n n n D) 

= ^ p(rfe >72%, Gi,„nz)), (89) 

where the last equality is from the inclusion C I^^ ^ In ^ In ■ Similarly, 

P(/- h,n u /2,„, DD) = nu^^j-ik u /2,„}, n 
= P(u,g,-\,- Jfc G If,, n /£ J, n z?) < P(A: /i,n, k i2,n, n d) 

= P(Tfc <7l,n,fc0?i,n, Gi,„nD). (90) 

8.6.2 Final bracketing relation 

It can be shown using some rather lengthy technical arguments (provided in the technical note) 
that, given appropriate 72, 72,+ and 72,-, for all sufficiently large n, except on a set of negligible 
probability, uniformly in 6* G Qg'^{Ci, . . . , Cm), 



Tk < 72,n if k^I+, 
Tk > lln if k£ I-\ /{"„. 



(91) 



Once ([9T]) is established, it follows from ([89|) . ([90]) . and some probabilistic bounds (also given 
in the technical note) that there exists n^g such that for all n > n^Kg, 

P(/- C h,n U /2,„ C /+, Gi,„) > 1 - i^6(M)n"i-^^("), (92) 

for some KeiM) > and e2(K) > 0. Moreover, the bound ([92]) is uniform in 6^ G Of (Ci, . . . , Cm). 
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8.7 Second stage : perturbation analysis 



The rest of this section deals with the part of the proof of Theorem 3 that involves analyzing 
the behavior of the submatrix of S that corresponds to the set of selected coordinates. To begin 
with, define I„ := /i,n U l2,n, and G^^n '■= {^n ^ /„ C I^} n G2,„. Then define 



o o 



In , In 

O 



O 



(93) 



In this section A wih denote the set In, B = {I, . . . , N} \ A =: A"", A± = I^, ^_ = \ A, 
= {1, . . . , N} \ A^ =: A'L- The first task before us is to derive an equivalent of Lemma 3. 
This is done in Section [8.81 The vector HyS20y is expanded, and then the important terms are 
isolated in Section 18.91 Finally, the proof is completed in Section 18.101 



8.8 Eigen-analysis of S2 

u = 1,... ,M are the eigenvectors corresponding to the M largest (in decreasing order) 
eigenvalues of 82- As a convention {9^, 6u) > for all 1/ = 1, ... , M. Let the first M eigenvalues 
of S2 be ^1 > . . . > H-M- Then arguments similar to what are used in Section [8.41 establishes the 
following results. 

On G3 „, for all /Li = 1, ... , M, 



Q ||2^ —2 



71, /X 



and 



2-q 

Cg72,+ 



M 



(n/i(A^))i-'?/2 ■ 



\it\ < jL ■■= 72:^M^/'(E K^y'^c^) 

Under C1-C3, as n — )• 00, for all = 1, . . . , M, 



n 



log(n V N) 



q/2 



7± 

•^2,n 



nh{K) - - xl {nh{K)Y~^/^ 

and Tn := maxi<^<M 7^n,/i 0. Again, check that |I^| is bounded by and || O^^j-y |p 
is bounded by 7|_,_A^log(n V N){nh{X^))~^ . This observation leads to the fact alluded to in 
Remark 15.21 

For j = 1, . . . , 4, define Ej^n as Ej^n is defined in (j78|) . with J^^ replaced by J^„. Then define 



A? (E;ili(^)'^/'C^^)aog(nViV))-^/2 



0; 



(94) 



(95) 



(96) 



e5,n = Cq72 



1-9/2 



M 

E 



^(Ai) 



l-q/2 / , ^ g/2 

Ai 



It follows that maxi<j<5ej_„ — )• as n — )• 00. Define 



CI 



log(n ViV )V~'^/^ 
nh{\i) 



(97) 



Ai 



max{A,^_i - Xu, X^ - Xu+i} 



E^j> + 



M 



\^a7^ 



(98) 
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and A„ = maxi<,y<j\,/ ^n,v A result that summarizes the behavior of the first M eigenvalues 
of S2 can now be stated. 



Proposition 3: There is a measurable set G^^n C G^^n, oind an integer n-tj > Hj^q, such that, 
for all n > n^-j the following relations hold, uniformly in 6 & 0*^(Ci, . . . , Cm)- 

G4,„ C f|{4 = A,(S2) and 14- (1 + A,)| <Ai^ (99) 

u=l j=l 
5 

Gi,n C {II S2-S||< Ai(^ej> + 



\ Ai=l ^ 

1-P(G4,„) < K7(M)n-l-^^ (101) 

/or some constants Kj{M) > and €3 > 0. £3 depends 0/71, 7^, 7^ a±, 72, 72,±; o^nd k. 

At this point it is useful to define a quantity that will play an important role in the analysis 
in Section [8.91 Define, 

*-=^- + ij^S^^p^. f=V...M. (102) 

Then define = niaxi<^<A./ and observe that, under C1-C2, — )■ as n — )■ 00. 

We argue that, for n > n^g, say, on a set G^^n with probability approaching 1 sufficiently 
fast, 

<|| H^SiO^ f (1 + 

where 6n,N,i/ = o(l). Therefore, it remains to show that, E || HyS20i, |p 1^^ is bounded by 
the quantity appearing on the RHS of (i24|) . 

8.9 Analysis of 7^^8261^ 

In this section, as in Section 18.101 v is going to be a fixed index in {1,... ,Af}. Before an 
analysis of H^S29u is carried out, a few important facts are stated below. Here G is any subset 
of {1, . . . , iV} satisfying A_ C C. 

IV - i^f^,C, 0^,c)\ = \{0,^,C^O^,C^-)\ < Tn^^^Tn^u < {^n,^, V ^?„,;,)^?„, (104) 

max -1;^ < ^tH^L- (105) 
l<^l<M nh{Xf,) h{XM) ' 

Further, 



/ logn ^_ / logn logn-yj2+„ 
i<Ai,M'<Af y nh{\^') y nh{\M) nh{\y) ^ ■ ' 

which follows from CI, C2, ([M]), ([Ml), and (fT02]) . 
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Next, observe that H^O^ = implies that 



Hu^Aa{^AA - I)0u,A 
Hu,Ba{SaA - I)Ou,A 



say. 



(107) 



Then and have the general form, for C = A,B 

M 



M II 

V 



c 



n 



M ^ 

,CA^p.,A + \/\t (^u,A)Hy^cA(^ , 



7fl,A 



)H,,ca-Zav,+ A^^ii^yvV {(^fi',A, Oi^,A)Hufi'Adfj.,A 



1 



n 



(108) 



When C is either yl or and 5c a is 1 or according as whether C = A or not, 

M 



Hu,Ca(^ij.,A 



^l,A)^'^',c 



1 



—{^cA- II 6*/^,^ lP)6'/i,c; 



(109) 



Hu,CA'Z'AVfj. = ^— p(Za'U^,6'i/',a)^'i/',C 



1 *^ 

— (5caI - 6^i^cQu',a)'^av^i\ 
Ay 



(110) 



l/'=l 



-f^!^,CA(-ZAZ5 - /)6',y,A = ^ ( -{'^A(^A,u','ZA(^A,iy) - {Ou',A,0iy,A)} Oyi c 

1 1 

-t-{ScaI - Ou',cO,',a){-ZaZI - i)e,^A. ( 



111) 



A further expansion of terms a and can be computed, but at this point it is beneficial 
to isolate the important terms in the expansion. Accordingly, use Lemmas 9 - 13, together 
with (|104|) . (|105|) and (|106p to deduce that, there is a measurable set G^^n C G4^„, constants 
Ks{M) > 0, £4 > and an n^g > n^,7 such that, 1 - P(G5,n) < KsiM)!!'^'"-^ , for n > n^g, and 

^' = ^-0 + ^'7 + + + ^'/y + ^-rem, (112) 

where || ^'rem II < bn'&n,u, with bn = 0(1), and the other elements are described below. 

^-0,^ = and ^o,B = 0u,B- (113) 
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= Yit+y Wy^yQ^ where w^,y equals 



1 ^ 1~~ 

ti=l 

where Zia_ = 'Zia_ and Zai = O, i.e. a matrix whose entries are all 0; and H is a iV x A'^ matrix 
whose {A-,A-) block is identity and the rest are all zero. 

1 1 ~ 

*m = -^{1 -Y,e^el)-T.v,. (116) 

^'/y is such that ^ly^A- = 0, ^ly^B = 0, and 

^WA- = -\^A- (^«- + ^ZL^.,A_) . (117) 

8.10 Completion of the proof of Theorem 3 

Suppose without loss of generality that n*8 in Section 18.91 is large enough so that A„ < ^"-^ . 
Since on G^^n-, || S2 — S ||< min{Ajy — Ajy+i, — Ajy}A„, where Aq = 00 and Ajvf+i = 0, argue 
that, by Lemma 6, for n > n^s, on G^^n, 

<|| H^S2eu f (1 + (118) 

where 6n,N,u = o(l)- Therefore, it remains to show that, E || HyS20u |P l^j, is bounded by 
the quantity appearing on the RHS of ()24p . In view of the fact that, this upper bound is within 
a constant multiple of "i?^,^, and || ^rem ||= o{'dn,v) on G^^m it is enough that the same bound 

holds for E II ^ - ^rem IP 1?^ • 

Observe that ^7, ^'/z, and m are mutually uncorrelated vectors. Also, by E || ^0 P 
Igs,, < T'n,!.- Therefore, 



E II ^ - ^>rem f % 



< tI^^ + E II ^7 f +E II f +E II ^iii f +E II -^iv f 1g5,„ 

+2E|(^o, ^/ + -^11 + ^m)|lG5,„ + 2E|(^'7y, ^-q + ^-7 + ^-z/ + ^m)|lG5,„ (US) 

Observe that, ^uA- = -^(^ " EJIi e/.,A_eJ,^_)(iZA_Z^_ - /)0,,^_, 



1^/1 \ 
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and 

1 ^ 1 1^1 

^iii,A- = — 7^{I -^Oi,,a.OJ^^_)-Za_v^, ^iiiAt = -J^^-{'^A.Vu,9^,^A.)0^,,A'L■ 

Thus, by a further apphcation of Lemmas 9-12, it can be checked that, there is an integer 
?^*9 > ?^*8) and an event Ge_„ C G^^n such that, for n > n^g, on Ge_„, 

1(^-0, ^/ + ^-/I + ^777)1 + K*/V,^'0 + ^Z + ^-Z/ + I < 6'n<., (120) 

with 6; = o(l); and P(G6.„nG5,„) < n^^Ci+^s)^ for some constants ^"9(1^) > and £5 > 0. On 

1 „ „ /I l^T 



^IV lr< ^ II Z . , ^t;^ + — Zi_0^,A 



n2 " ^+/- vva; a, 

where := \ j4_, and the (unrestricted) expectation of the random variable appearing 

in the upper bound is bounded by ^^/[pl^- From this, and some expectation computations 
similar to those in Section [3 deduce that, 

tI, + f +E II ^11 f +E II ^iii f +E II ^iv f < ^n,.(l + 0(1)). (121) 

Finally, express the event G^^n as (disjoint) union of G^^n H G^^n and G^^n H Gg„; apply the 
bound (jl20p for the first set, and use Cauchy-Schwartz inequality for the second set, to conclude 
that, 

Eli^o,"^! + "^11 + *m> I 1g5,„ + IE| (^-zy , + + + ^iii) I 1g5,„ 
< 6;<, + 2/^9(M)^"'''^^?n = o«J. (122) 

Combine (fTT9]l . (fml) and (fT22]) to complete the proof. 

9 Appendix 

Some results that are needed to prove the three theorems are presented here. 
9.1 Deviation of extreme eigenvalues 

The goal is to provide a probabilistic bound for deviations of || ^ZZ-^ ~ II- This is achieved 
through the following lemma. 

Lemma 5: Let tn = 6(^ V ^)^J~^^J^■ Then, for any c > 0, there exists > I such that, 
for all n > TLc, 

P I II -ZZ'^ - / ||> 2W — + — + ct„ 1 < 2(n V Nr''\ (123) 
\ n \ n n j 
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Proof : By definition, 

II -ZZ^ - / 11= max{Ai(-ZZ'^) - 1, 1 - A7v(-ZZ^)|. 

n n n 

From Proposition 4 (due to Davidson and Szarek (2001)), and its consequence, Corollary 2, 
given below, it follows that, 

F I II -ZZ^ - / ||> 2a/— + — + ctn] 
< expf -cX \ f ncHl \ 

- "^y 8{ctn + {1 + y^r) J V 8(ctn + (1 - v^)')y 

First suppose that n > N. Then for n large enough, ct„ < |, so that 

> and ^^-^ — > 



8(ct„ + (1 + ^/Nj^Y) - 36 ' 8(ct„ + (1 - v^)2) - 12 

Since in this case nt^ = 36 log n, (fT23]) follows from (fT2i|) . If iV > n, then AAr(iZZ^) = 0, and 



8{ctn + (1 ± v/W^)') 8(c^tn + (1 ± aAV^)') ' 
and therefore, (jl23p follows if the roles of n and are reversed. 

Proposition 4: Xei Z be a p x q matrix of i.i.d. N{0, 1) entries with p < q. Let Smax{Z) and 
SminiZ) denote the largest and the smallest singular value of Z, respectively. Then, 

P(w(^^) > 1 + V^ + i) < e"'*'/', (125) 

FiSmini^Z) < 1 - VpT^ -t) < e-«*'/2. (126) 



Corollary 2: Let S = ^ZZ'^ where Z is as in Proposition 4, with p < q. Let mi{p,q) := 

(1 + Y^)^ ^p{PjQ) •= (1 ~ ' -^i(^) '^^^ \{^) denote the largest and the smallest 
eigenvalues ofS. Then, for t > 0, 

i{S) - mi{p,q) > t) < exp (^-|(\/t + mi(p, g) - y/mi{p, q)f^ 

and 



^ + ''''' 



P(Ap(S) - mp(p,g) < -t) < exp ( --(yt + mp(p,g) - 'y/mp(p,g))' 

qt^ 
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9.2 Perturbation of eigen-structure 

The following lemma is most convenient for the risk analysis of estimators of 0^. Several variants 
of this lemma appear in the literature (Kneip and Utikal (2001), Tyler (1983), Tony Cai and 
Hall (2005)) and most of them implicitly use the approach proposed by Kato (1980). 

Lemma 6: For some T G N, let A and B he two symmetric TxT matrices. Let the eigenvalues 
of matrix A be denoted by Ai(^) > . . . > Xt{A). Set Ao(^) = oo and Xt+i{A) = — oo. For any 
r G {l,...,r}, if Xr{A) is a unique eigenvalue of A, i.e., if \r-i{A) > Xr{A) > Xr+i{A), then 
denoting by Pr the eigenvector associated with the r-th eigenvalue, 

Pr{A + B)- sign{pr{A + 5)^p^(yl))p^(A) = -Hr{A)Bpr{A) + Rr (129) 

where Hr{A) := ^^.^^ ^ {A)-x {A) -^^si^) '^^^ PSsi^) denotes the projection matrix onto the 
eigenspace Eg corresponding to eigenvalue Xs{A) (possibly multi- dimensional) . Define A,, and 
Ar as 

Ar := ^[|| Hr{A)B II +|A,(^ + B) - A,(^)| || Hr{A) \\] (130) 

II ^ II 

^'^ mmi<j^r<T\^j{A) - Xr{A)\ ' ^"^^"^^ 

Then, the residual term R can be bounded by 

?A n -J-9A II U ( A\Rr> ( A\ II 1 

} (132) 



r ^2 / .X „ r 2Ar(l + 2A^) \\ Hr(A)Bpr(A) 

i?, ||< min{10A„ \\ Hr{A)Bpr{A) \\ , Z , . + " ^ ^ ' 



1 - 2Ar(l + 2Ar) (1 - 2Ar(l + 2A^))2 
where the second bound holds only if A^ < . 

9.3 Proof of Proposition 1 

Proof : For n i.i.d. observations Xi,i = 1, . . . ,n, the KL discrepancy of the data is just n 
times the KL discrepancy for a single observation. Therefore, w.l.o.g. take n = 1. Direct 
computation yields 

M 

S-^ = (/-5^r?(A,)Mj). (133) 
Hence, the log-likelihood function for a single observation is given by 
\ogf{x\e) = -:|log(27r)-^log|E|-^x^S-ix 

M M 

= -_log(27r)--^log(l + A.)--((a;,a;)-5^r7(A.)(a;,^.)2). (134) 

u=l u=l 

Recall that, if distributions Fi and F2 have density functions /i and /2, respectively, such that 
the support of /i is contained in the support of /2, then the KuUback-Leibler discrepancy of 
F2 from Fi, to be denoted by ^(^1,^2), is given by 

K{FuF2) = I log|M/i(y)a!y. (135) 
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Hence, from (jl34p . 



E,a)(log/(X|0«-log/(X|0(2)) 

M 




^E^(^^)[(ii ^^-'^ II' - II ^^'^ 11')' + E v{ii f 




which equals the RHS of (j26p . smce the columns of 0^-^^ are orthonormal for each j = 1,2. 



9.4 A counting lemma 

Lemma 7: Suppose that m, N are positive integers, such that m ^ oo as N ^ oo and 
m = o{N). Let Z be the maximal set of points in satisfying the following conditions: 

(i) for each z = (zi, . . . , z^r) ^ Z , zi ^ {0, 1} for all i = 1, . . . , N , 

(a) for each z ^ Z, exactly m of coordinates of z are 1, 

(Hi) for every pair z and z' in Z , Zi = z\ for at most [^] =: k{mo) — 1 (i.e. fc(mo) is the 
largest integer < mo/2 + 1) nonzero coordinates, where mo = [/3m], for some j3 E (0, 1). 

Then cardinality of Z is at least exp([A^<S(^) — 2m£{^)]{l +o(l))) where £{x) is the Shannon 
entropy function. 

Proof : Trivially, Z <Z Z* ^ where Z* is the set of all points z satisfying (i) and (ii). Thus, 
\Z\ < \Z*\ = (^). On the other hand, for every point z G Z* there are at most 



points w G Z* such that at least A:(mo) nonzero coordinates of z and w match. This is because, 
one can fix the m nonzero coordinates of z and demand that in A; (mo) of those coordinates Wi 
must equal 1. Other m — fc(mo) nonzero coordinates of w can therefore be chosen from the rest 




39 



N — k{mo) coordinates. Then, by the maximality of Z, as N ^ oo, 



\Z\ > 



m 



g{N,m,mo) ^ 
N\ k{moy.{m - k{mo))\ {m - k{mo)y.{N - my. 



{N - m)!m! 

m 



{N-k{mo)y 



/ fe(mo)!(m — k{mo)y. 



k{moy.{N - k{mo)y \ 



ml 



m[N — k{mQ))'^/^ k[mo) N — k{mo) 

[(M!^)fc(mo)( "^-^("^o) )m-fc(mo)]2 ^^y StirHng's formula) 



exp 



-2mf(^)(l + o(l)) 



(136) 



Where the last equality is because, for large m, 



m 2 ■ 



9.5 Some auxiliary lemmas 

In the following lemmas we provide probabilistic bounds for the deviations of certain quadratic 
forms that arise in the analysis of the residual terms in the expansion of Oy. Many of these 
involve the random sets, either 7i^„ or l2,„, of coordinates that are selected under the ASPCA 
scheme. It will be assumed that the quantities involved are all measurable w.r.t. the joint 
distribution of Z and vi, . . . , vm, though it will not be made explicit in the description or the 
proof of the lemmas. The bounds hold uniformly in ^ G 0^^(Ci, . . . , Cm)- 

Lemma 8: Let e„ > 0. Let A denote the random set In,i, (ind = I^^ and A-^- = I^^. 
Assumx that A^ C \ {k}, for some 1 < k < N. For any subset C of {1, . . . ,N}, let 
Yc := YcCZicV) be a random vector jointly measurable w.r.t. Zc and V = [vi : . . . : vm]- 
Assume that for each C, either ¥v{Yc = 0) = a.e. V , or Py(Yc' = 0) = 1 a.e. V , where Py 
denotes the conditional probability w.r.t. V . Let Wk,c = {Zk, 0; Wk,c = 

otherwise. Then, 

¥{\Wk,A\>en,A_cAcA+\{k}, \\ V \\< Pn) < —H-en), (137) 

On 

where j3n is such that, on {\\V \\< fin}, a.e. V , 

Py(?fcfe<l+7i,n)>an>0. (138) 



Lemma 9: Let A be a random subset o/ {1, . . . , A''} and A- C A^ be two non-random subsets 
o/ {1, . . . , A^}. Let, k± denote the size of the set A±, and 

e„ = a/ ci log n+ II Oy^A'L II Vci logra + 2fe+ log 2, 



40 



for some ci > 0. Then, for alll < ji < M , 

4^-ci/2 



(K|^,^m)I > en, A_ C a C < 



v27r V ci log n 

Lemma 10: Let A, A±, k±, and A- he as in Lemma 9. Let 



(139) 




/c2logn /cilogn + 2fc+log2 

en =11 Gfj^At II (1 + V ^ 



n V ra V n 



where c\,C2 > 0. T/ien 

P (|-(ZL^.._,Z5J,,._>I > en, ^- C ^ C ^) < ^1^^^ + n-/^. 



(140) 



Lemma 11: Let A, A±, k± be as in Lemma 9. Let, t„ = 6(^ V '^)\f^^^^- Let 

+ 2 II e,^. II (1 + ,/E + ,/^) >^'°e" + 2ttiog2 

V n V n V n 
/or some ci, C2 > 0. T/ten f/iere is an n{c2) > 16 such that, for n > n(c2), \/c2/2t„ < 1/2, anc? 

P (^1^ II Z5^^,A 11^ - II 9u,A f I > en, A-CAc A+^ 
o^— C2/2 

< 2n-"^/^ + ^\ + n-"^/^ + 2{n V fc+)-"^/^ (141) 

V 27r\/ci log n 

Lemma 12: Lei A, A;± 6e as in Lemma 9. Let, fJ- ^ i', and for some t > 0, 



^ + (II e,,^. II + II ^,^c ||)(1 + + /£3lggZ^) / c2logn + 2fc^log2 



+ II 6u,At nil ^/i>^f 



II ^.Ac nil II (2^^ + ^ + V^2t.), 

Vra Vra, ra 



where 01,02,03 > and tn is as m Lemma 11. Then, there is 71(03) > 16 swc/t i/tai, for 
n > n{c2,), ^Jcitn < \, and 

p (^^{z\e^,A,^A&,j^,A) - {e.,A^e^,A)\ > en,A_ C ^ C A^ 

< 2n-3'^i/2+0(^)+2ra-'^2/^ + ^ij^^2f= + n-^3/2 + 2(nVA;+)-^3/2_ (^42) 

V 27r\/ci logn 
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Lemma 13: Let A, A±, k± be as in Lemma 9. Let, 
where ci,C2 > 0. Also, suppose that A;+ > 16. Then, 



\^ II Z^_z£e^3_ II > en,A_ C ^ C ^+) < + ^^43) 



9.6 Deviation of quadratic forms 

The following lemma is due to Johnstone (2001b). 

Lemma 14: Let x^^^ denote a Chi-square random variable with n degrees of freedom. Then, 

nxln)>n{l + e)) < e-3-'/i6 (0 < e < i), (144) 
P(X^„) < n(l - e)) < e— (0 < 6 < 1), (145) 



P(X(„) > n(l + e)) < -^e""^'/^ (0 < e < n^/^^,n > 16). (146) 



The following lemma is from Johnstone and Lu (2004). 

Lemma 15: Let yii,y2i,i = 1, . . . ,n be two sequences of mutually independent, i.i.d. N{0,1) 
random variables. Then for large n and any b s.t. < b <^ ^Jn, 

P(|- V?/uy2i| > VV^) < 2exp{-^ + 0(n-i62)j_ (^47) 
n 2 
1=1 
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